A Nested Dissection Approach to Modeling Transport in 
Nanodevices: Algorithms and Applications* 



Modeling nanoscale devices quantum mechanically is a computationally challenging 
problem where new methods to solve the underlying equations are in a dire need. In this 
paper, we present an approach to calculate the charge density in nanoscale devices, within 
the context of the non equilibrium Green's function approach. Our approach exploits 
recent advances in using an established graph partitioning approach. The developed 
method has the capability to handle open boundary conditions that are represented by 
full self energy matrices required for realistic modeling of nanoscale devices. Our method 
to calculate the electron density has a reduced complexity compared to the established 
recursive Green's function approach. As an example, we apply our algorithm to a quantum 
well superlattice and a carbon nanotube, which are represented by a continuum and tight 
binding Hamiltonian respectively, and demonstrate significant speed up over the recursive 
method. 
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1 Introduction 

With the advent of smaller nanoelectronic devices, where quantum mechanics is central to the 
device operation, and new nanomaterials, such as nanotubes, graphene, and nanowires, quan- 
tum mechanical simulations have become a necessity. The non-equilibrium Green's function 
(NEGF) method PQ [21 Ej has emerged as a powerful modeling approach for these nanodevices 
and nanomaterials. The NEGF method is based on the self-consistent coupling of Schrodinger 
and Poisson equations and is designed to capture electron scattering effects with phonons. 
A typical NEGF-based simulation solves three Green's function equations. 
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G^{E) is called the retarded Green's function, describing local density of states and the prop- 
agation of electrons injected in the device, and (G^'(-E))^ its Hermitian conjugate. G^{E), 
the lesser Green's function, represents the electron correlation function for energy level E; 
the diagonal elements of G^{E) represent the electron density per unit energy. G^{E), the 
greater Green's function, represents the hole correlation function for energy level E, which is 
proportional to the density of unoccupied states. I is the identity matrix and H the system 
Hamiltonian. and represent the self-energies due to left and right contact coupling and 
"^Phonon correspouds to the self-energy governing electron-phonon scattering. The matrix 
corresponds to the lesser self-energy and the matrix to the greater self-energy. The Green 
functions are then incorporated in the coupling between the Schrodinger and Poisson equations 
- see PP for further details. The self-consistent solution of the Schrodinger and Poisson equa- 
tions requires to solve many times until consistency is achieved. It is well appreciated that 
the computationally intensive part of this calculation is solving ([T]) for the diagonal element of 
(electron density) and at all energies E. The objective of this paper is to present a 
new algorithm for accelerating the solution of ([T|. 

The recursive Green's function method (RGF) [3 El El [101 EO] is an effective method, often 
used in practice, to compute G^, G^, and G^. For elongated devices, this approach remains 
the most efficient. Recently, the Hierarchical Schur Complement (HSC) method [HI [15] and 
the Fast Inverse using Nested Dissection (FIND) method [HI [12] exploit the nested dissection 
method [5] to exhibit a significant speedup. The key ideas behind these two algorithms are to 
partition the whole matrix A into blocks for an efficient block LU-factorization. This factoriza- 
tion is then re-used to fill in all diagonal blocks of the Green's function and some off-diagonal 
blocks in a specific order. These two algorithms are more efficient than RGF and have reduced 
the operation count down to a multiple of the cost for a block LU-factorization of a sparse 
matrix. Approximate methods, that are efficient in the ballistic limit, exist also. For example, 
the contact block reduction [THl HT] accelerates the computation by using a limited number of 
modes to represent the matrices. The focus of this paper is to develop an exact method that 
works in the presence of scattering. 

For calculating the lesser Green's function G^, advanced algorithms for an arbitrary sparse 
matrix are still in their infanc}|^ The RGF method remains an effective method, especially for 
elongated devices. The extension of FIND [T2] for G*^ yields a reduced asymptotic complexity 
but the constant in front of the asymptotic term hinders the reduction in runtime. Li et al. [13] 
have recently proposed a modification of FIND for a significant speedup but their partitioning 
of the matrix A requires some pre-processing. The contribution of this paper is to present 
an extension of the HSC method for calculating diagonal blocks for G^ with partitions from 
existing graph partitioning libraries (like, for example, the package METIS [9J). 

The rest of the paper is organized as follows. Section 2 will review exact methods and 
discuss differences between previous algorithms and the proposed approach. Section 3 will 
give a mathematical description of the new algorithm. Finally, Section 4 describes numerical 
experiments to highlight its efficiency. The discussion and analysis in this paper will focus 
on two-dimensional problems, while three-dimensional problems will be illustrated in a future 
publication. 

2 Review of Eexact Methods 

Consider a device that can be topologically broken down into layers as shown in Figure [TJ For 
an effective mass Hamiltonian, the blue dots represent grid points of the discretized Green's 
function equation, while, in the case of a tight binding Hamiltonian, the blue dots represent 

^An efficient algoritlim for caiculating is also efficient for computing G^. 
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Figure 1: Nano-device partitioned into Ny layers. Each layer contains grid points. 

When a five-point stencil is used for discretization, the resulting Hamiltonian is a symmet- 
ric block tri-diagonal matrix as shown in Figure [2| where each diagonal block represents the 
Hamiltonian of a layer in Figure [1} The i-th diagonal block of the Hamiltonian represents the 
coupling between grid points / atoms in a layer. The off-diagonal blocks to its left and right 
represents coupling to between layers i and i — 1 and i + 1 respectively. Both the diagonal 
and off-diagonal blocks of the Hamiltonian are sparse in many examples where either an effec- 
tive mass or a tight-binding Hamiltonian represents the dynamics — examples of such device 
include silicon nanowires, nanotubes, and graphene. 

The left and right coupling contacts (indicated in Figure [T]) are two semi-infinite leads 
connected with the device and infinite matrices represent their Hamiltonians. Their respective 
effect can be folded into layer 1 and layer Ny, resulting in dense blocks for the first and the A^j^-th 
diagonal blocks of the self-energy matrices S^^ and S^. The resulting matrix structure of the 
matrix A, defined in ([2]), is shown in Figure |2| Note that the self-energy matrix Sp^^^^^ is set 
to be diagonal at each interior grid point, which may arise due to electron-phonon interaction 
or any other interaction. Relaxing the requirement of diagonal ^^phonon include more realistic 
models of scattering and solving equations ([T]) remains a challenge. 

The most common approach to compute blocks of and is the recursive Green's 
function method [I] • RGF is an algorithm composed of two passes to compute G'' and two 
passes to compute G*-. In both cases, the passes are interpreted as follows: 

1. the first pass marches one layer at a time from left to right along the y-direction and, 
recursively, folds the effect of left layers into the current layer; 

2. the second pass marches one layer at a time from right to left along the y-direction and, 
recursively, extracts the diagonal blocks and the nearest neighbor off-diagonal blocks for 
the final result. 

Numerically, it is essential to notice that the RGF method exploits the matrix sparsity of only 
at the block level, which means that it separates the whole problem into sub-problems of full 
matrix operations. The complexity of this method is, at most, lOA^^A^^ (when N^ < Ny). 
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Figure 2: H (left) and A (right) matrix shape, non-zero entries are highhghted. 

To compute block entries of C, two recent advances, namely FIND [HI |12] and HSC 
[lll[T5], utilize the nested dissection method [5] to exhibit a significant speedup. These methods 
explicitly exploit the sparsity of A via a sparse block LU-factorization of the whole matrix and 
re-use this factorization to fill in all diagonal blocks of the Green's function and some off- 
diagonal blocks in a specific order. FIND and HSC have a strong mathematical component 
and their physical interpretation is less obvious. The main difference between RGF and these 
methods is the replacement of layers of grid points organized along a specific direction with 
arbitrarily- shaped clusters of grid points organized in a binary tree. Such choice allows to fold 
and to extract in any physical direction when following the vertical hierarchy of the binary 
tree. Further details about FIND and HSC can be found in their respective references. Table [T] 
summarizes the complexity of these three state-of-the-art algorithms when computing entries 
in 



Algorithm 


Complexity when = Ny 


Complexity when Nr^ < Ny 


RGF 


O (N^J 


< lON^Ny 


FIND nn ng 


O (iV3) 


< USN^Ny 


HSC [12 ng 


O (iV3) 


< 87N^Ny 



Table 1: Complexity of algorithms to compute diagonal blocks of G^'. 



Even though FIND [TT] and HSC [E] have two distinct mathematical motivations, their 
runtime complexities exhibit the same dominating term. But the constants multiplying these 
asymptotic terms differ greatly. The runtime for FIND [TT] contains a large constant due to 
the usage of thick boundaries between the clusters of grid points (or width-2 separators — a 
width-2 separator is a boundary between clusters that has a thickness of 2 grid points or 2 
atoms). In addition, generating these clusters with thick boundaries is not compatible with 
most existing partitioning libraries. On the other hand, the HSC method use thin boundaries 
between clusters with a thickness of one grid point (or width- 1 separators) and can directly 
exploit partitions from existing partitioning libraries. As a result, to compute diagonal blocks 
for G^ the HSC method [H P- 758] is more efficient than FIND [TI]. 

To compute diagonal entries for G^, only the RGF [1] and FIND [12] methods have been 



extended. Table |2] displays the runtime complexity for computing diagonal blocks of G 



Algorithm 


Complexity when A^^. = Ny 


Complexity when < Ny 


RGF [21J 




O (N^Ny) 


FIND [12J 


< 457A^3 


O {NlNy) 



Table 2: Complexity of algorithms to compute diagonal blocks of G 



The extension of FIND [12] for G*^ still use thick boundaries that result in a large constant 
for the runtime and that are incompatible with existing partitioning libraries. Recently, Li et 
al. [13] have proposed an extension of FIND [HI [12] that enables thin boundaries (width-1 
separators). This new algorithm yields a significant speedup over earlier versions of FIND 
[1111121 . The mathematical motivation remains different from HSC. The clustering part needs 
to identify peripheral sets, an information that is not provided directly by most partitioning 
libraries. 

The contribution of this paper is to present an extension of the HSC method [H] for calcu- 
lating the diagonal blocks for G*^. This extension uses thin boundaries (or width-1 separators) 
and is compatible with existing partitioning libraries — namely, the extension is combined with 
the graph partitioning package METIS [9]. The same extension computes efficiently diagonal 
blocks for G'* in a separate step. For the sake of conciseness, the rest of the paper is focused 
only on computing G^. 

3 Mathematical Description of the Algorithm 

In this section, a detailed mathematical description for the extension of HSC to compute blocks 
of G^ is given. The key ingredients are: 

1. an efficient sparse block LDL^^-factorization of A. The block sparse factorization will 
gather grid points into arbitrarily-shaped clusters (instead of layers, like in RGF). Such 
choice allows to fold and to extract in any physical direction when eliminating entries 
in A. The factorization yields formulas to calculate the diagonal blocks and off-diagonal 
blocks for G^' and G^. Exploiting the resulting algebraic relations results in an algorithm 
with a cost significantly smaller than the full inversion of matrix A. 

2. an appropriate order of operations. The cost of a matrix multiplication BCD depends 
on the order of operations. When B is m x p, C is p x fc, and D is A; x n, (BC) D costs 
2mk{n+p) operations and B (CD) costs 2np{m + k). The order of operations can have a 
large effect when multiplying series of matrices together (which is the case for computing 
entries of G^ and G*-). Furthermore, when working with sparse matrices, one order of 
operations may preserve sparsity, while another may not. 

First, a simple description with three clusters is given. Then the approach is extended to an 
arbitrary number of levels and a multilevel binary tree. 

3.1 Description for a simple case 

The basic idea is to partition the nano-device into three disjoint regions (L, R, S) — see Figure 

m 

Such a partition is easily obtained via the nested dissection, introduced by George [5]. 
Nested dissection divides the system into two disconnected sets and an interface, called the 
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Figure 3: Nanodevice partitioned into two subregions (L, R) and a separator S (L stands for 
Left and R for Right). 



separator S. With this partition, the matrix A can be written as 



A 



Note that matrix A is typically complex symmetric. The block LDL -factorization of A is 
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where A55 is the Schur complement, 

Ass = Ass - ^lsA'llAls - A^^A^^A^^s. 
The matrix satisfies the relation 

= (I - L^) G" + B^L^ with A = LDL'^ (3) 
(described in Takahashi et al. [22] and Erisman and Tinney [4|). The block notation yields 
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The diagonal blocks G^^ and G^^, for the regions L and R, respectively, are computed inde- 
pendently of each other, 
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(where the symmetry of G^ has been exploited). For this simple case, the resulting algorithm 
matches exactly the HSC method [14j . 

For calculating entries in the correlation matrix G^, Petersen et al. [IS] generalized Taka- 
hashi's method by writing 

L^G< (L^)^ = D-iL-iS<L-tD-t. (4) 
Recurrence formulas are described in p^. Our approach utilizes a variant of Q, namely 

G< = (I - L^) G< + D-iL-iS< (GO^ , 
and exploits the symmetry of G*" and the skew-Hermitian property of and G*^, 
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The block LDL^^-factorization of A yields 



G' 



AlIAlsGsj^ A^IAlsGsj^ 



ArrArsG^j^ 



ArrArsG^j^ 



AlIAlsGss 

ArrArsG'^S 















r 








+ 





A"^ 

^RR 













(Ass 







-1 



I 



-A'^ A~i 





I 

-A^ A-i 

^RS-^RR 






I 



Parts of G*" are computed with the previous algorithm, namely, G^^^, GJjjij, G^^, G^^, and 
G^_5. By assumption, is a block-diagonal skew-Hermitian matrix with purely imaginary 
entries. The partial matrix multiplication gives 
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(where the starred blocks are not computed). This relation indicates 



G^s 
Gls 



^ss [ ^ss y^ss 



AJ.sAll'^ll {G^gj^y 



^RS-^RR^RR K^SR) 



-AlIAlsGss + A^^S^^ 



G 



RS — "A^pjjAij.sG^^ + A^]jS^^ {G^sR. 



,t 



,t 



Finally, the diagonal blocks G^^ and G^^, for the regions L and R, respectively, are computed 
independently of each other. 
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(where the skew-Hermitian property of has been exploited). 

This derivation for calculating the correlation matrix differs from the FIND method [T2l 
[13] in several aspects. It uses thin boundaries obtained directly from the nested dissection. This 
extension requires one sparse factorization and one back-substitution, while FIND utilizes only 
sparse factorizations but applied many times with different orderings. The order of operations 
to obtain the diagonal blocks of G^ is also different from the recurrence in Petersen et al. jl8j . 
which uses the sequence 

^ L-^D-iL-is< {L-^y {B-^y {L-'^y , 

while our HSC extension uses 

E< ^ S< (G^^ ^ L-iS< (G^)^ ^ D-iL-iS< (G^')^ ^ L-^D-^L-iS< (G^)^ 

When working with sparse matrices, specific order of operations may result in fewer operations. 
In numerical experiments, the latter ordering was more efficient. 

3.2 Description for a multilevel case 

In this section, the description is extended to an arbitrary number of clusters. 

Even though computing the diagonal for the inverse of a matrix is not equivalent to a sparse 
factorization, both problems benefit from matrix reordering. The multilevel nested dissection, 
introduced by George f^, lends itself naturally to the creation of grid points clusters. Typically, 
nested dissection divides the system into two disconnected sets and an interface, called the 
separator. Then the process is repeated recursively on each set to create a multilevel binary 
tree. 

Based on the hierarchical structure of the tree, let Pi denote the set of all cluster indices j 
such that cluster j is an ancestor of cluster i. For example for Figure |4| P5 is equal to {1,2} and 
Pi5 = {1, 3, 7}. Let Ci denote the set of all cluster indices j such that cluster j is a descendant 
of cluster i. For the partition on Figure |4| C4 is equal to {8, 9} and C3 = {6, 7, 12, 13, 14, 15}. 
Note that a cluster may or may not have a direct coupling in the matrix A to any of its ancestors 
or descendants. 

Once the partition is set, the algorithm may be separated into two distinct parts: compu- 
tation of G^ and computation of G^. 

Computation of blocks for G'' 

In the binary tree, the levels are labeled from bottom to top, where level 1 contains all clusters 
at the end of the tree and level L contains only the original separator. For simplicity of 
presentation, let A*^'^ denote the matrix transformed from A after folding all the clusters up to 
level /. Note that A^*^^ is set to A and A^^~^) is block diagonal. 

The computation of blocks for G*^ involve three steps: folding the lower level clusters unto 
the higher ones, inversion of the matrix for the main separator, and extracting of the diagonal 
blocks for the current level from blocks on higher level. 

The algorithm for the first step goes as follows: 

• For / = 1 up to L — 1, 

- AW = A('-i) 

— For all the clusters i on level I, 
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Figure 4: Example of a multilevel partition. 



a(0 



Af I for all j in Pi 



* A^l = A'll + ^'ijK^i for all j and k in 

* A^l'^. = (aJJ)^ for all j and A; in Pi 

* A^J] = and aJ-J] = for all j in 



(0 



.T A (0 



end 



end 



The next step is written as the inversion of A^^ ^-^ , which is symmetric and block diagonal. 
. G(^-i) = (A(^-i))~' 

In practice, the operation requires only the inversion of the block for the top separator. All the 
other blocks have already been inverted during the folding steps. 

Finally, all the diagonal blocks of C are extracted one level at a time. The algorithm goes 
as follows: 

• For I = L — 2 down to 0, 

- G(') = G('+i) 

— For all the clusters i on level I, 

* Gf^ = GJ;] + J2keP, for all cluster indices j in Pi 

* G^jj = ^G-'] j for all cluster indices j in Pj 
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— end 

• end 

The resulting algorithm to compute block entries in C matches exactly the HSC method 

Note that matrix G^"-* is not equal to C because G^^-* is incomplete (see an example in 
the Appendix). However, all the entries in G*-°\ in particular the diagonal entries, match the 
corresponding entries in G''. 

Computation of blocks for G^ 

The algorithm consists of four steps. The first step uses the matrix G'^^-' computed previously. 

• N = S< (G(o))^ 

All the entries in G*^°^ match the corresponding entries in G^ and are sufficient to compute 
diagonal blocks of G*^. The matrix S*^ is typically block diagonal. The matrix N will have the 
same structure and shape as G^''^ The matrix multiplication is done block by block. 

Next the lower level clusters are folded into the higher ones. This step is critical and the 
most time consuming. Let N*^') denote the matrix transformed from N after folding all the 
clusters up to level /. N'^°^ is set to N. 

• For / = 1 up to L — 1, 

— n(') = N('~^) 

— For all the clusters i on level /, 

* Nfl = Nfl + ^l.'Nfl for all j and k in 

— end 

• end 

Similarly to Step 1, Step 3 is a block diagonal multiplication. 

Finally, Step 4 extracts all the diagonal blocks one level at a time. This step is similar to the 
extraction in the G*" algorithm. The operations are the following: 

• For I = L — 2 down to 0, 

— p(0 = pC+i) 

— For all the clusters i on level /, 

* PjJ] = + Y.k&P, *i,fcP£ all cluster indices j in Pi 

t 



* P^'] = — ^Pj-j j for all cluster indices j in Pj 
end 



end 



At the end, matrix P*^°) is not equal to G^ because P*^*^) is incomplete (see an example in 
the Appendix). However, all the entries in P(°), in particular the diagonal entries, match the 
corresponding entries in G^. 
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Comments on the system partition 



The partitioning of the system (or the clustering of points) is the key step for the efficiency of 
this algorithm. The partition should follow two rules: 

1. for clusters within the same level on the binary tree, no interaction is allowed. Operations 
on blocks at the same level are performed independently. 

2. the partition should minimize the size of separators and reduce the clusters down to a 
size manageable for an inversion of the corresponding block matrix. 

The multilevel nested dissection generates a partition that satisfies those rules. It is worth- 
while to note that, as long as the rules stated above are followed, systems with non- uniform 
distribution of points or with a different stencil could be treated correctly. 

The RGF algorithm is included in the previous description with a very specific partition. 
Such a partition is illustrated in Figure |5] 
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Figure 5: Partition generating the RGF algorithm. 

In many cases, self-energy functions add two A^^^ x dense blocks into the input sparse 
matrix A in first and last block (see Figure |2]). One possible partition combines the two contacts 
together with the middle separator 1, as shown in Figure |6j The weakness of such clustering is 
the size of the first separator (or root region). A larger separator increases the computational 
cost spent on this level. More descendant blocks are coupled with this separator and the total 
number of operations will increase dramatically. Another partition that satisfies the previous 
two rules is plotted in Figure [7} 

4 Numerical Experiments 

This section describes numerical experiments on two simple models: a super-lattice structure 
and a graphene nanotube. The algorithm is implemented in a C code that is interfaced with 
METIS [9]. 

4.1 Cost analysis 

First the complexity of the HSC extension is compared numerically to the complexity of RGF. 
A model device is considered where the system Hamiltonian is discretized with a five-point 
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Figure 6: First method to partition system with two dense layers and two ends. 




Figure 7: Partition generated by METIS for system including dense layers at two ends. 
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stencil. The left and right contact self-energies are neglected for this section only. A typical 
partition is plotted in Figure |4j 

The numerical estimate tracks the operation counts step by step for all the matrix multipli- 
cations and matrix inversions throughout the code. For a multiplication of two matrices with 
dimensions i x j and j x k, a. total of ijk operations is added. For inversion of a matrix of 
dimension i x i, operations are counted. 

Figure |8] shows the cost comparison between the HSC extension and RGF for two-dimensional 
square systems with the same number of grid points (or atoms) per direction, i. e. = Ny = N. 
The plot in logarithmic scale indicates that RGF exhibits a complexity of 0{N^), while the 
HSC-extension shows a 0{N^) complexity. 
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Figure 8: Numerical count comparison for our algorithm (blue) and RGF (red). 



4.2 Results 

4.2.1 Super-lattice device 

A super-lattice device is typically a multi-layered energy barriers system. The device is com- 
posed of repeating junctions of energy barriers and wells. To verify the simulation results, 
a two-dimensional system of lengths Ix = 25nm and ly = 20nm is considered and plotted in 
Figure [9j Here the structure has eight barriers, each of width Inm and of height 400meV. The 
wells have a width of Inm. The length of the left flat band region is 2nm and the right fiat 
band region is 3nm long. 

A simulation with a five-point stencil discretization on a grid with spacing dx = dy = O.lnm 
is made for the Fermi energy Ef = 140meV and 500 energy points uniformly distributed between 
to SOOmeV. The density of states, electron density, and current are calculated by the RGF 



method and the extended-HSC method. The output electron density is plotted in Figure 10 



Linear electron density in the ^/-direction is illustrated in Figure 11 The figures indicate that 



the charge distribution in the barrier-well multi-layer junctions are symmetric, as expected. 
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Figure 11: Electron density in y direction of the example super-lattice structure. 
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Next devices of lengths Ix = x O.lnm and ly = Ny x O.lnm are used to compare the two 
algorithms. The number of barriers is kept at 8. The lengths for the two-sides fiat region are 
adjusted according to the lengths of device Ix and ly. The other parameters remain unchanged. 
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Figure 12: Superlattice device NEGF simulation computation time comparison for RGF and 
our methods, all systems grid spacing is O.lnm. (a) Square system of with diagonal self-energy 
matrix; (b) Square system of with dense self-energy matrix; (c) For systems in this plot, the 
length in the x-direction is fixed at 25nm while the length in the y-direction is increased, (d) 
For systems in this plot, the length in the y-direction is fixed at lOnm while the length in the 
x-direction is increased. Dense self-energy matrices are used in (c) and (d) devices. 



In Figure 12 a), diagonal self-energy matrices are used for the left and right contacts. Cal- 
culation times are compared for square systems — i. e. Nx = Ny = N — and plotted in Figure 
[T2|a). As expected, the HSC-extension exhibits smaller CPU times and a complexity of 0{N'^) 
while RGF's complexity is (9(A^^). 

In Figure [l2|(b), CPU times for square systems with dense self-energy matrices for both 
contacts are plotted. Here again the HSC-extension exhibits smaller CPU times. A complexity 
0{N^) for HSC-extension compared with 0{N^) for RGF can be seen. 

Figure [l2|^c) plots CPU times for rectangular devices where Ix = 25nm (or Nx = 250) and 
the length in the y-direction is varied. Dense self-energy blocks for the left and right contacts are 
employed. The implementation of RGF is biased towards the x-direction so that its complexity 
is 0{Ny). The linear trend is clearly present in the plot. For the HSC-extension, similarly 
to the cost of a sparse LDL^ factorization, the computational cost is 0{Ny). Clearly, the 
constant for RGF is larger for this device. 
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To illustrate the dependence of this constant with respect to Nx-, Figure 12 d) plots the CPU 



times when A^j, is fixed and A^^^ is varied. The recorded CPU times illustrate that the RGF 
method has an asymptotic complexity O {Nj^), while the HSC extension exhibits a complexity 
0{N^). So, for rectangular devices, the RGF method has a complexity 0{N^Ny) and the 
HSC-extension a complexity 0{N^Ny). 



4.2.2 Graphene 

Graphene is one of the most promising next-generation materials. Its remarkable electric prop- 
erties, such as high carrier mobility and zero band gaps, generate a rapidly increasing interest 
in the electronic device community. Since 2007, many advances in graphene-based transistor 
development have been reported. [19] 




Figure 13: Graphene hexagonal structure decomposed by tight biding method. Dashed rect- 
angular illustrates one repeating hexagon layer. Dashed lines represent inner four atom layers, 
showing the atoms ordering in tight binding Hamiltonian construction. 

The NEGF simulation of graphene transport is based on tight binding method, which yields 
a four-point-stencil Hamiltonian due to system decomposition of carbon atoms coupling (see 



Figure 13). In the numerical experiments, armchair planar graphene nanoribbon structures are 
simulated. The on-site energy for each carbon atom is and the hopping parameter between 
two nearest carbon atoms is -3.1eV. The Fermi energy is set to 0. The simulation is run for 
only one energy point E = 0.5eV. 



Simulation timings are plotted in Figure 14 for graphene structures of different sizes. To 



minimize the dimension of blocks to invert in RGF, one layer of hexagonal structure is divided 



into four layers (see the dashed lines in Figure 13). Conclusions on the asymptotic complexity 



remain unchanged. Namely, the HSC extension is more efficient than the RGF method for 
square and rectangular structures. The complexity of RGF for four-point stencil behaves as 
0{N^Ny) with the same constant as five-point stencil, which is expected due to the layered 
system partition. In Figure [l4|(a) and (b), for a four-point stencil system, our HSC-extension 
exhibits a complexity of 0{N^) for square system. Figure [l4|(c) and (d) also demonstrate 
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Figure 14: Graphene device NEGF simulation computation time comparison for RGF and the 
HSC extension, based on tight binding theory, (a) Square system of with diagonal self-energy 
matrix, (b) Square system of with dense self-energy matrix, (c) For systems in this plot, the 
number of atoms in the x-direction is set to A^^^ = 250, while the number of atoms in the 
^/-direction is increased, (d) For systems in this plot, the number of atoms in the y-direction is 
set to Ny = 100, while the number of atoms in the x-direction is increased. Dense self-energy 
matrices are used in (c) and (d) devices. 



a complexity growing linearly with A^j^ and, respectively quadratically with A^^;, when Nx, 
respectively Ny, is fixed. The final result is a complexity 0{N^Ny). The constant in front 
of N^Ny for the extended HSC method is smaller for the graphene structures than for the 
superlattice devices. This reduction is explained by a more efficient partitioning of four-point 
stencil systems by METIS . 



5 Conclusion 

In this paper, an approach to calculate the charge density in nanoscale devices, within the 
context of the non equilibrium Green's function approach, is presented. This work exploits 
recent advances to use an established graph partitioning method, namely the nested dissection 
method. This contribution does not require any processing of the partition and it can handle 
open boundary conditions, represented by full self-energy matrices. The key ingredients are an 
efficient sparse block LDL^-factorization and an appropriate order of operations to preserve 
the sparsity as much as possible. The resulting algorithm was illustrated on a quantum well 
superlattice and a carbon nanotube, which are represented by a continuum and tight binding 
Hamiltonian respectively, and demonstrated a significant speed up over the recursive method 
RGF. 
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A Description of the Algorithm for a Three-level Tree 

In order to make the extension more comprehensive, a description of the HSC extension is given 
for a three- level system (see Figure 15). 
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Figure 15: Partition for a three-level system. 

The size of the partition is chosen to make the description as relevant as possible without 
becoming overcomplicated. The first level contains regions 1,2, and 4. The second-level refers 
to region 3 and the top level is the root or region 5. To illustrate the algorithm, steps from 



Section |3.2| are described for this particular device. 

When a five-point stencil is used for discretization, the structure of the matrix A, after 
re-ordering, is 



A 



The blocks An and A44 are dense for representing the contacts with the semi-infinite leads. 



An 





Ai3 





Ai5 





A22 


A23 





A25 


^13 


A^ 


A33 





A35 











A44 


A45 


A"^ 

^15 


A^ 

^25 


A'^ 
^35 


A^ 


A55 



19 



Recall that A'-'^-' is equal to A. Then inner points in regions 1, 2, and 4 are eliminated by 
block Gaussian elimination — the effects of the inner points in regions 1, 2, and 4 are folded 
over their boundary. This first step yields the matrix A*^^^ 
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where the updated matrices are 

A33 - AfgA^/Ais 

A35 - AfgA^/A 
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55 
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After folding the effects of regions 1 and 2, block Agg^ is now a dense block. Figure 
the change of sparsity between A and A*^^). 
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illustrates 
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Figure 16: Sparsity of matrix A and matrix A*^^). 
In the next step, the remaining off-diagonal blocks are eliminated to obtain the matrix A*^^), 
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where the block A55 is 
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Note that several blocks are unchanged, hke An, A22, -^33, and A44. The next step is written 
as the inversion of A^^\ which is symmetric and block diagonal. 
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All the other blocks have been 



This operation requires only the inversion of the block Ag 
inverted during the folding steps. 

Next diagonal blocks of C are extracted one level at a time. Starting from the main root 
(or separator), blocks at level 2 are updated to obtain 
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Finally, blocks for regions 1, 2, and 4 are updated, yielding the matrix G^^^ 
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Blocks for region 4 are satisfying 
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and blocks for region 2 
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Figure 17 displays the sparsity of the resulting matrix G^°\ All the entries in G*^''-* are equal 
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Figure 17: Sparsity of matrix G'-^-'. 

to their corresponding entries in G''. 

The computation of diagonal blocks in G^ are described for the same device. First the 
matrix N is computed, 
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Set N^*^) = N. Next the lower level clusters are folded into the higher ones to obtain the matrix 
N(i) 
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where the updated blocks are 
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For the top level, the block for region 5 is updated 
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The next step is a block-diagonal multiplication 
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Finally, Step 4 extracts blocks one level at a time, defining first 
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Note that P*^^-' is not skcw-Hermitian. However finalized blocks, such as P 
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and 



Pgg"*, satisfy the skcw-Hermitian property. Finally, blocks for regions 1, 2, and 4 are updated, 
yielding the matrix P^'^\ 
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Blocks for region 4 are satisfying 
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All the entries in P(°) are equal to their corresponding entries in G^. 
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